Technote 1181Sherlock’s Find by Content Text Extractor Plug-insBy John MontbriandApple Worldwide Developer Technical Support |
CONTENTS
Overview |
This Technote describes the API for creating Find By Content Text Extractor Plug-ins. Text Extractor Plug-ins are used by Find by Content to extract the textual information stored in a document when it is creating indexes and summarizing files. By doing so, it is possible for users to avoid indexing peripheral data such as formatting commands, HTML tags, and other data that does not relate to the information stored in the document. By creating Text Extractor Plug-ins for their document types, developers make it possible for users to conduct meaningful searches for information stored in documents created by their applications. Text Extractor Plug-ins can be created for use with Mac OS 8.6 and later. Mac OS 8.6 was shipped with two Text Extractor Plug-ins: the “HTML Text Extractor” and the ”PDF Text Extractor.” The “HTML Text Extractor” strips the HTML tags from HTML files and returns the text stored therein; the “PDF Text Extractor” returns the textual information from Adobe®’s Portable Document Format (PDF) files. In Mac OS 8.5, indexing HTML files meant that both the text stored in the document and the HTML tags were incorporated into indexes. Furthermore, PDF files were excluded from the indexing process. In Mac OS 8.6, meaningful textual information extracted from these files is incorporated into index files used by Find By Content. This Technote provides information necessary for creating and installing Text Extractor Plug-ins. In addition, an annotated example Text Extractor Plug-in is provided. Developers can easily modify this example to create their own plug-in for use with their own file formats. |
Text Extractor Plug-ins DefinedText Extractor Plug-ins are Code Fragments that have the following characteristics:
A Text Extractor Plug-in’s resource file may contain one or more
|
Registering the MIME Types a Plug-in can UnderstandClients of Text Extractors need to map documents to a MIME type. To
help clients determine the document types a plug-in understands,
a plug-in can include one or more
Listing 2. A sample 'mimp' resource for PDF files.
When creating indexes, Find By Content uses calls to Internet Config
to discover the file’s MIME type. Once a file’s MIME type has been
discovered, it then uses the a Text Extractor Plug-in capable of extracting
text from the file (based on the MIME types the extractor advertises it
can decode in its |
Structures Used By Plug-insFind By Content provides a number of routines and callbacks that can be used by Text Extractor Plug-ins. These callbacks provide access to memory allocation and file input. The following sections describe the structures used by Find By Content to provide these callbacks and the callbacks themselves. Application developers wanting to call Text Extractor Plug-ins from their own code will want to create and initialize these structures themselves. Examples of how to do this can be found later in the Calling a Text Extractor Plug-in from an Application section below. The IAPluginInitBlock StructureThe
Listing 3. Declaration of the IAPluginInitBlock structure and prototypes that can be used for calling the routines referenced in the structure.
Applications developers wanting to call Text Extractor Plug-ins from
inside of their own applications will have to initialize this structure
and define the necessary callbacks themselves. An example showing how to
set up a The
|
/* IADocAccessorRecord structure definition. */ typedef struct IADocAccessorRecord* IADocAccessorPtr; struct IADocAccessorRecord { /* docAccessor is an opaque type used by Find By Content to track the file. It is not possible for plug-ins to access this information. */ IADocAccessorRef docAccessor; IADocAccessorOpenUPP OpenDoc; IADocAccessorCloseUPP CloseDoc; IADocAccessorReadUPP ReadDoc; IASetDocAccessorReadPositionUPP SetReadPosition; IAGetDocAccessorReadPositionUPP GetReadPosition; IAGetDocAccessorEOFUPP GetEOF; }; typedef struct IADocAccessorRecord IADocAccessorRecord; /* Routine Prototypes. */ OSStatus CallIADocumentAccessorOpen(IADocAccessorRef inAccessor); OSStatus CallIADocumentAccessorClose(IADocAccessorRef inAccessor); OSStatus CallIADocumentAccessorRead(IADocAccessorRef inAccessor, void* buffer, UInt32* ioSize); OSStatus CallIASetDocumentAccessorReadPosition(IADocAccessorRef inAccessor, SInt32 inMode, SInt32 inOffset); OSStatus CallIAGetDocumentAccessorReadPosition(IADocAccessorRef inAccessor, SInt32* outPostion); OSStatus CallIAGetDocumentAccessorEOF(IADocAccessorRef inAccessor, SInt32* outEOF); /* macros corresponding to the routine prototypes above */ #define CallIADocumentAccessorOpen(accessor) \ InvokeIADocAccessorOpenUPP((accessor)->docAccessor, \ (accessor)->OpenDoc) #define CallIADocumentAccessorClose(accessor) \ InvokeIADocAccessorCloseUPP((accessor)->docAccessor,\ (accessor)->CloseDoc) #define CallIADocumentAccessorRead(accessor, buffer, size) \ InvokeIADocAccessorReadUPP((accessor)->docAccessor, (buffer),\ (size), (accessor)->ReadDoc) #define CallIASetDocumentAccessorReadPosition(accessor, mode, offset) \ InvokeIASetDocAccessorReadPositionUPP((accessor)->docAccessor,\ (mode), (offset), (accessor)->SetReadPosition) #define CallIAGetDocumentAccessorReadPosition(accessor,\ outPosition) \ InvokeIAGetDocAccessorReadPositionUPP((accessor)->docAccessor,\ (outPosition), (accessor)->GetReadPosition) #define CallIAGetDocumentAccessorEOF(accessor, outEOF) \ InvokeIAGetDocAccessorEOFUPP((accessor)->docAccessor, \ (outEOF), accessor)->GetEOF) |
IADocAccessorRecord
structure and prototypes that can be used for calling the routines referenced in the structure.
The IADocAccessorRecord
defined in Listing 4 provides plug-ins with all the necessary resources for accessing files. Plug-ins should not
make calls to the File Manager directly. Instead, they should perform
all file input operations necessary for accessing a file through
these callbacks. Fields and callbacks defined in this structure
are discussed below.
CallIADocumentAccessorOpen
OSStatus CallIADocumentAccessorOpen( IADocAccessorRef inAccessor);
result— |
CallIADocumentAccessorOpen
is a callback procedure provided
in the IADocAccessorRecord
structure that can be called by plug-ins to open a file for input.
CallIADocumentAccessorOpen
opens the document
for reading. Plug-ins should call this routine to open
the document for reading before making any of the input
calls described below.
CallIADocumentAccessorClose
OSStatus CallIADocumentAccessorClose( IADocAccessorRef inAccessor);
result— |
CallIADocumentAccessorClose
is a callback procedure provided
in the IADocAccessorRecord
structure that can be called by plug-ins to close a file that
was opened by a call to CallIADocumentAccessorOpen
.
CallIADocumentAccessorClose
should be called to
close a file opened by a call to CallIADocumentAccessorOpen
.
CallIADocumentAccessorRead
OSStatus CallIADocumentAccessorRead( IADocAccessorRef inAccessor, void* buffer, UInt32* ioSize);
result— |
CallIADocumentAccessorRead
is a callback procedure provided
in the IADocAccessorRecord
structure that can be called by plug-ins to read data from a file.
CallIADocumentAccessorRead
reads *ioSize
bytes from the file starting at the current read file position. On return, *ioSize
will reflect the actual number of bytes read and the routine’s result will indicate the success of the call. If this callback returns an eofErr
error, be sure to check the value stored in *ioSize
as it is possible that some bytes may have been read into the buffer before the end of the file was encountered. Calls to CallIADocumentAccessorRead
advance the read position for the file past the bytes that have been read—the next call to CallIADocumentAccessorRead
begins where the last one left off.
CallIASetDocumentAccessorReadPosition
OSStatus CallIASetDocumentAccessorReadPosition( IADocAccessorRef inAccessor, SInt32 inMode, SInt32 inOffset);
result— |
CallIASetDocumentAccessorReadPosition
is a callback procedure provided in the IADocAccessorRecord
structure that can be called by plug-ins to set the position
where the next read will take place when CallIADocumentAccessorRead
is called.
CallIASetDocumentAccessorReadPosition
can be used
to set the position where the next call to CallIADocumentAccessorRead
will begin reading bytes from the file. When a file is first opened, its read position is set to the beginning of the file.
CallIAGetDocumentAccessorReadPosition
OSStatus CallIAGetDocumentAccessorReadPosition( IADocAccessorRef inAccessor, SInt32* outPostion);
result— |
CallIAGetDocumentAccessorReadPosition
is a callback procedure provided in the IADocAccessorRecord
structure that can be called by plug-ins to determine the position where the next read will take place when CallIADocumentAccessorRead
is called.
CallIAGetDocumentAccessorReadPosition
returns the
location where the next read operation will take place in *outPostion
. The value returned is an offset from
the beginning of the file.
CallIAGetDocumentAccessorEOF
OSStatus CallIAGetDocumentAccessorEOF( IADocAccessorRef inAccessor, SInt32* outEOF);
result— |
CallIAGetDocumentAccessorReadPosition
is a callback procedure provided in the IADocAccessorRecord
structure that can be called by plug-ins to determine length of the input file.
CallIAGetDocumentAccessorEOF
can be used to discover
the length of a file. On return, *outEOF
is set to the total number of bytes in the file.
Applications developers wanting to call Text Extractor Plug-ins from
inside of their own applications will have to initialize this structure
and define the necessary callbacks themselves. An example showing how
to set up a IADocAccessorRecord
structure can be found in
the Setting up the IADocAccessorRecord
structure section later in this document.
Routines a Text Extractor Must DefineThis section describes the routines that must be exported by all Text Extractor Plug-ins. This section provides a detailed description of each routine along with some discussion any important issues related to each routine.
|
OSStatus IAPluginInit( IAPluginInitBlockPtr initBlock, IAPluginRef *outPluginRef);
result— |
IAPluginInit
is a routine that must be provided
in the plug-in’s code fragment.
After the plug-in’s code fragment has been prepared for execution,
the plug-in’s IAPluginInit
routine is called. This routine
provides an opportunity for a plug-in to perform any necessary initialization
operations it may require.
The callbacks in the
IAPluginInitBlock
pointed to by the
initBlock
parameter remain valid while the plug-in is open (until
IAPluginTerm
is called)
and may be called from any of the plug-in’s other routines. The value
stored in *outPluginRef
is dedicated for the plug-in’s use and may be used to store persistent state information that is to remain intact
between calls to the plug-in (this value is not saved after the plug-in
has been closed).
For an example illustrating how this routine could be implemented refer to Listing 6.
IAPluginTerm
OSStatus IAPluginTerm(IAPluginRef inPluginRef);
result— |
IAPluginTerm
is a routine that must be provided
in the plug-in’s code fragment.
Before a plug-in’s Code Fragment Manager connection is closed, the
plug-in’s IAPluginTerm
routine is called. This routine
provides opportunity for the plug-in to perform any necessary
cleanup operations required such as deallocating storage, closing
resource files, et cetera. After this routine has been called, there
will be no other calls made to the plug-in until the next time it
is opened by a call to IAPluginInit
.
For an example illustrating how this routine could be implemented refer to Listing 7.
IAGetExtractorVersion
OSStatus IAGetExtractorVersion( IAPluginRef inPluginRef, UInt32 outPluginVersion);
result— |
IAGetExtractorVersion
is a routine that must be provided
in the plug-in’s code fragment.
In this routine, a plug-in should set the value *outPluginVersion
to the version of the Text Extractor Plug-in interface it was compiled against.
The constant kIAExtractorCurrentVersion
, defined in “IAExtractor.h,” contains the current version of the Text Extractor Plug-in interface.
For an example illustrating how this routine could be implemented refer to Listing 8.
IACountSupportedDocTypes
OSStatus IACountSupportedDocTypes( IAPluginRef inPluginRef, UInt32* outCount);
result— |
IACountSupportedDocTypes
is a routine that must be provided
in the plug-in’s code fragment.
This routine should set *outCount
to the number of document types the plug-in is able to handle. The value stored in *outCount
is interpreted as the maximum valid index that can be provided as an index in
IAGetIndSupportedDocType
calls.
For an example illustrating how this routine could be implemented refer to Listing 9.
IAGetIndSupportedDocType
OSStatus IAGetIndSupportedDocType( IAPluginRef inPluginRef, UInt32 inIndex, char** outMIMEType);
result— |
IAGetIndSupportedDocType
is a routine that must be provided
in the plug-in’s code fragment.
The routine IAGetIndSupportedDocType
sets
*outMIMEType
to point to a string containing
the nth MIME type the plug-in is able to understand. Index
values that may be provided in the inIndex
parameter
range from 1 (not zero) through the maximum value as reported
by the IACountSupportedDocTypes
call.
For an example illustrating how this routine could be implemented refer to Listing 10.
IAOpenDocument
OSStatus IAOpenDocument( IAPluginRef inPluginRef, IADocAccessorPtr inAccessor, IADocRef* outDoc);
result— |
IAOpenDocument
is a routine that must be provided
in the plug-in’s code fragment.
IAOpenDocument
is called before a plug-in is used to extract
text from a new document. This routine provides opportunity for the
plug-in to perform any initialization operations required before
it begins reading text from a file. Any state variables or data
buffers required for processing the file should be stored in a
block of memory and a pointer to that block should be stored in
*outDoc
. This value will be passed to the routines
IAGetNextTextRun
,
and IAGetTextRunInfo
while
the document is open, and then to IACloseDocument
once all the required text has been extracted from the document.
Both the IAPluginInitBlock
pointed to by the inAccessor
parameter and the value stored in
*outDoc
will remain valid until
IACloseDocument
is called.
For an example illustrating how this routine could be implemented refer to Listing 11.
IACloseDocument
OSStatus IACloseDocument( IADocRef inDoc);
result— |
IACloseDocument
is a routine that must be provided
in the plug-in’s code fragment.
IACloseDocument
is called after all textual information required
has been extracted from the document. In this call, the plug-in should
dispose of any state variables or buffers that were created specifically
for the file referenced by the inDoc
parameter.
For an example illustrating how this routine could be implemented refer to Listing 12.
IAGetNextTextRun
OSStatus IAGetNextTextRun( IADocRef inDoc, void* buffer, UInt32* ioSize);
result— |
IAGetNextTextRun
is a routine that must be provided
in the plug-in’s code fragment.
The IAGetNextTextRun
routine should copy text from the document
into the memory buffer pointed to by the buffer
parameter
until that buffer is full, or the plug-in runs out of text. If the
language encoding changes from one language to another while text
is being decoded, the plug-in mark that location in the text stream
by returning the result code errIAEndOfTextRun
.
For an example illustrating how this routine could be implemented refer to Listing 13.
IAGetTextRunInfo
OSStatus IAGetTextRunInfo( IADocRef inDoc, char** outEncoding, char** outLanguage);
result— |
IAGetTextRunInfo
is a routine that must be provided
in the plug-in’s code fragment.
IAGetTextRunInfo
returns information about the character encoding and the language of the text for the last buffer returned by
IAGetNextTextRun
.
Both parameters are optional and may or may not be present
depending on the caller’s requirements. If a parameter is not
required, then it will be set to NULL
.
If the plug-in allocates a pointer to a string and stores
that pointer either in *outEncoding
or in *outLanguage
, then it is the plug-in’s responsibility to deallocate that storage.
If either value is not known, the plug-in may store the value
NULL
in either *outEncoding
or in *outLanguage
.
This value instructs the caller that the current character encoding or language is not known by the plug-in.
A pointer to a string containing the Internet name for the
character encoding is returned in the *outEncoding
parameter.
Encoding is the internet name for an encoding (i.e., “iso-8859-1,”
“x-mac-roman,” “euc-jp,” ...). ADDITIONAL INFORMATION ON ITS WAY.
For an example illustrating how this routine could be implemented refer to Listing 14.
An Example Plug-inThe following annotated example illustrates how to create a Text Extractor Plug-in for the “text/plain” MIME type. As the function of this plug-in is to pass text from the file to the caller, its implementation is very simple. Developers can easily modify this example to extract text from their own file formats.
Listing 5. File header & imports for Text Extractor Plug-ins. The only important aspect of the above is the header file being included. Here, the file “IAExtractor.h” containing the necessary constant and structure definitions is included.
Listing 6. IAPluginInit example.
The
Listing 7. IAPluginTerm example.
Normally, the
Listing 8. IAGetExtractorVersion example.
The value
Listing 9. IACountSupportedDocTypes example.
In this example, we only support one document type—plain text documents.
Listing 10. IAGetIndSupportedDocType example.
In the above declaration of
Listing 11. IAOpenDocument example.
In the
Listing 12. IACloseDocument example.
In the
Listing 13. IAGetNextTextRun example.
In the
Listing 14. IAGetTextRunInfo example. In this example, we return |
Calling a Text Extractor Plug-in from an ApplicationFollowing is an example of how a client may use a Text Extractor Plug-in to extract the text of a document. Applications may use these routines or some variant of them to call Text Extractor Plug-ins to extract text from virtually any document type. The steps below show how to set up the plug-in’s code fragment, set up the callback structures, and finally how to call the plug-in to perform the text extraction. This example does not show how to find or determine the correct plug-in for a particular document. Setting up a Text Extractor Plug-inFirst, we begin by setting up the plug-in’s code fragment for execution and storing pointers to the routines we want to call in a structure we will use to access the plug-in. Listing 15 contains the routines and declarations used to perform this task.
Listing 15. Routines for setting up a Text Extractor Plug-in’s code fragment for execution. The prototypes provided in Listing 15 allow us to call back to the plug-in.
Pointers to these routines are stored in the Setting up the
|
/* routines exported in the IAPluginInitBlock record. Here we have defined our own set of routines that call through to the Mac OS memory manager. */ static void* MyIAAlloc(UInt32 inSize) { return (void*) NewPtr(inSize); } static void MyIAFreeProc(void* object) { DisposePtr((Ptr) object); } static UInt8 MyIAIdleProc(void) { return 0; } /* NewIAPluginInitBlock allocates a new init block record containing memory allocation routines and idle routines that can be called by a plug-in. If an error occurs, the function returns NULL.*/ IAPluginInitBlockPtr NewIAPluginInitBlock(void) { IAPluginInitBlockPtr iBlock; iBlock = NULL; iBlock = (IAPluginInitBlockPtr) NewPtrClear(sizeof(IAPluginInitBlock)); if (iBlock == NULL) goto bail; iBlock->Alloc = NewIAAllocProc(MyIAAlloc); if (iBlock->Alloc == NULL) goto bail; iBlock->Free = NewIAFreeProc(MyIAFreeProc); if (iBlock->Free == NULL) goto bail; iBlock->Idle = NewIAIdleProc(MyIAIdleProc); if (iBlock->Idle == NULL) goto bail; return iBlock; bail: if (iBlock != NULL) { if (iBlock->Alloc != NULL) DisposeRoutineDescriptor((UniversalProcPtr) iBlock->Alloc); if (iBlock->Free != NULL) DisposeRoutineDescriptor((UniversalProcPtr) iBlock->Free); if (iBlock->Idle != NULL) DisposeRoutineDescriptor((UniversalProcPtr) iBlock->Idle); DisposePtr((Ptr) iBlock); } return NULL; } /* DisposeIAPluginInitBlock releases the memory occupied by the init block record allocated in NewIAPluginInitBlock. */ void DisposeIAPluginInitBlock(IAPluginInitBlockPtr iBlock) { DisposeRoutineDescriptor((UniversalProcPtr) iBlock->Alloc); DisposeRoutineDescriptor((UniversalProcPtr) iBlock->Free); DisposeRoutineDescriptor((UniversalProcPtr) iBlock->Idle); DisposePtr((Ptr) iBlock); } |
IAPluginInitBlock
structure.
The routines provided in Listing 16 allocate and deallocate the
IAPluginInitBlock
structure to use routines that call the Memory Manager.
IADocAccessorRecord
structureThe routines and declarations provided in Listing 17 illustrate how to set up the file access callbacks for a plug-in. Here, we allocate the callback structure and another structure for keeping track off the file itself.
/* MyDocumentReference contains information used by the caller to track the input source being used by the plug-in. In this example, we are using a Mac OS file. */ typedef struct { FSSpec spec; /* a copy of the file specification record */ Boolean docOpen; /* true when document is open */ short refnum; /* file reference number */ } MyDocumentReference, *MyDocRefPtr; /* in this example, we will fill the fields of the IADocAccessorRecord with routine descriptors referring to routines that call through to the Mac OS file system. These routines are defined below. */ static OSStatus MyIADocAccessorOpenProc(IADocAccessorRef inAccessor) { MyDocRefPtr refptr; IADocAccessorPtr accPtr; OSErr err; accPtr = (IADocAccessorPtr) inAccessor; refptr = (MyDocRefPtr) accPtr->docAccessor; err = FSpOpenDF(&refptr->spec, fsRdPerm, &refptr->refnum); if (err == noErr) refptr->docOpen = true; return (OSStatus) err; } static OSStatus MyIADocAccessorCloseProc(IADocAccessorRef inAccessor) { MyDocRefPtr refptr; IADocAccessorPtr accPtr; accPtr = (IADocAccessorPtr) inAccessor; refptr = (MyDocRefPtr) accPtr->docAccessor; if ( ! refptr->docOpen) return errIAParamErr; FSClose(refptr->refnum); refptr->docOpen = false; return errIANoErr; } static OSStatus MyIADocAccessorReadProc(IADocAccessorRef inAccessor, void* buffer, UInt32* ioSize) { MyDocRefPtr refptr; IADocAccessorPtr accPtr; OSErr err; accPtr = (IADocAccessorPtr) inAccessor; refptr = (MyDocRefPtr) accPtr->docAccessor; if ( ! refptr->docOpen) return errIAParamErr; err = FSRead(refptr->refnum, ioSize, buffer); return (OSStatus) err; } static OSStatus MyIASetDocAccessorReadPositionProc( IADocAccessorRef inAccessor, SInt32 inMode, SInt32 inOffset) { MyDocRefPtr refptr; IADocAccessorPtr accPtr; OSErr err; accPtr = (IADocAccessorPtr) inAccessor; refptr = (MyDocRefPtr) accPtr->docAccessor; if ( ! refptr->docOpen) return errIAParamErr; case (inMode) { case kIAFromStartMode: err = SetFPos(refptr->refnum, fsFromStart, inOffset); break; case kIAFromCurrMode: err = SetFPos(refptr->refnum, fsFromMark, inOffset); break; case kIAFromEndMode: err = SetFPos(refptr->refnum, fsFromLEOF, inOffset); break; default: err = errIAParamErr; break; } return (OSStatus) err; } static OSStatus MyIAGetDocAccessorReadPositionProc( IADocAccessorRef inAccessor, SInt32* outPostion) { MyDocRefPtr refptr; IADocAccessorPtr accPtr; OSErr err; accPtr = (IADocAccessorPtr) inAccessor; refptr = (MyDocRefPtr) accPtr->docAccessor; if ( ! refptr->docOpen) return errIAParamErr; err = GetFPos(refptr->refnum, outPostion); return (OSStatus) err; } static OSStatus MyIAGetDocAccessorEOFProc( IADocAccessorRef inAccessor, SInt32* outEOF) { MyDocRefPtr refptr; IADocAccessorPtr accPtr; OSErr err; accPtr = (IADocAccessorPtr) inAccessor; refptr = (MyDocRefPtr) accPtr->docAccessor; if ( ! refptr->docOpen) return errIAParamErr; err = GetEOF(refptr->refnum, outEOF); return (OSStatus) err; } /* NewIADocAccessorRec initializes a IADocAccessorRecord with routine descriptors referring to routines that call through to the Mac OS file system. It stores a record containing information about the file in the docAccessor field of the IADocAccessorRecord record. If an error occurs, th function returns NULL. */ IADocAccessorPtr NewIADocAccessorRec(FSSpec *targetFile) { IADocAccessorPtr docAcc; MyDocRefPtr refptr; iBlock = NULL; refptr = NULL; refptr = (MyDocRefPtr) NewPtr(sizeof(MyDocumentReference)); if (refptr == NULL) goto bail; refptr->spec = *targetFile; refptr->docOpen = false; refptr->refnum = 0; docAcc = (IADocAccessorPtr) NewPtrClear(sizeof(IADocAccessorRecord)); if (docAcc == NULL) goto bail; docAcc->docAccessor = (IADocAccessorRef) refptr; docAcc->OpenDoc = NewIADocAccessorOpenProc(MyIADocAccessorOpenProc); if (docAcc->OpenDoc == NULL) goto bail; docAcc->CloseDoc = NewIADocAccessorCloseProc(MyIADocAccessorCloseProc); if (docAcc->CloseDoc == NULL) goto bail; docAcc->ReadDoc = NewIADocAccessorReadProc(MyIADocAccessorReadProc); if (docAcc->ReadDoc == NULL) goto bail; docAcc->SetReadPosition = NewIASetDocAccessorReadPositionProc( MyIASetDocAccessorReadPositionProc); if (docAcc->SetReadPosition == NULL) goto bail; docAcc->GetReadPosition = NewIAGetDocAccessorReadPositionProc( MyIAGetDocAccessorReadPositionProc); if (docAcc->GetReadPosition == NULL) goto bail; docAcc->GetEOF = NewIAGetDocAccessorEOFProc( MyIAGetDocAccessorEOFProc); if (docAcc->GetEOF == NULL) goto bail; return docAcc; bail: if (refptr != NULL) DisposePtr((Ptr) refptr); if (docAcc != NULL) { if (docAcc->OpenDoc != NULL) DisposeRoutineDescriptor((UniversalProcPtr) docAcc->OpenDoc); if (docAcc->CloseDoc != NULL) DisposeRoutineDescriptor((UniversalProcPtr) docAcc->CloseDoc); if (docAcc->ReadDoc != NULL) DisposeRoutineDescriptor((UniversalProcPtr) docAcc->ReadDoc); if (docAcc->SetReadPosition != NULL) DisposeRoutineDescriptor((UniversalProcPtr) docAcc->SetReadPosition); if (docAcc->GetReadPosition != NULL) DisposeRoutineDescriptor((UniversalProcPtr) docAcc->GetReadPosition); if (docAcc->GetEOF != NULL) DisposeRoutineDescriptor((UniversalProcPtr) docAcc->GetEOF); DisposePtr((Ptr) docAcc); } return NULL; } /* DisposeIADocAccessorRec releases a IADocAccessorRecord allocated by NewIADocAccessorRec. All o the sub fields are deallocated, and, if the file is open, it is closed before the structure is deallocated. */ void DisposeIADocAccessorRec(IADocAccessorPtr docAcc) { MyDocRefPtr refptr; /* destroy the document reference */ refptr = (MyDocRefPtr) docAcc->docAccessor; /* make sure the file is closed—incase we're aborting */ if (refptr->docOpen) FSClose(refptr->refnum); DisposePtr((Ptr) refptr); /* release the accessor structure */ DisposeRoutineDescriptor((UniversalProcPtr) docAcc->OpenDoc); DisposeRoutineDescriptor((UniversalProcPtr) docAcc->CloseDoc); DisposeRoutineDescriptor((UniversalProcPtr) docAcc->ReadDoc); DisposeRoutineDescriptor((UniversalProcPtr) docAcc->SetReadPosition); DisposeRoutineDescriptor((UniversalProcPtr) docAcc->GetReadPosition); DisposeRoutineDescriptor((UniversalProcPtr) docAcc->GetEOF); DisposePtr((Ptr) docAcc); } |
IADocAccessorRecord
.
In Listing 17, we use File Manager calls to access the file. For tracking information
used by the File Manager, we store a pointer to a private structure containing that
information in the docAccessor
field of the IADocAccessorRecord
.
The routine provided in Listing 18 calls the Text Extractor Plug-in to gather textual information from a file. The text gathered from the file is passed back to the caller through a routine the caller provides as a parameter.
/* kETBufferSize determines the size of the buffer allocated for retrieving chunks of text. */ #define kETBufferSize (1024*1) /* TextSinkProc is a call back routine provided by the caller. Text will be passed to this routine as it is extracted from the file. */ typedef OSErr (*TextSinkProc)(void* text, long length, long refcon); /* ExtractTextFromFile calls the Text Extractor Plug-in referred to by *theExtractor to extract text from the file referred to by *targetFile. While extracting text, the text will be sent to the TextSinkProc provided by the textsink parameter. refcon is a value passed through to the TextSinkProc in its refcon parameter. */ OSErr ExtractTextFromFile(FSSpec *targetFile, FSSpec *theExtractor, TextSinkProc textsink, long refcon) { ExtractorRecPtr extractor; IAPluginInitBlockPtr initblock; IADocAccessorPtr accRec; IAPluginRef inPluginRef; UInt32 pluginVersion; Boolean exInited, docOpen; IADocRef docRef; Ptr etBuffer; /* set up locals to a known state */ extractor = NULL; initblock = NULL; accRec = NULL; exInited = false; docOpen = false; etBuffer = NULL; UInt32 bytecount; /* initialize the plug-in */ extractor = OpenExtractor(theExtractor); if (extractor == NULL) goto bail; /* initialize the callbacks used by the plug-in for basic memory tasks. */ initblock = NewIAPluginInitBlock(); if (initblock == NULL) goto bail; /* call the plug-in’s initialization routine. */ err = extractor->PluginInit(initBlock, &pluginRef); if (err != noErr) goto bail; exInited = true; /* query the plug-in to find out if we're using the interface we're using is in sync with the interface it was built to use. */ err = extractor->GetExtractorVersion(pluginRef, &pluginVersion); if (err != noErr) goto bail; if (pluginVersion != kIAExtractorVersion1) { err = errIAParamErr; goto bail; } /* initialize the callbacks used by the plug-in for file input with our document. */ accRec = NewIADocAccessorRec(targetFile); if (accRec == NULL) goto bail; /* allocate a memory buffer for reading */ etBuffer = NewPtr(kETBufferSize); if (etBuffer == NULL) { err = memFullErr; goto bail; } /* call the plug-in and ask it to open the document for input. */ err = extractor->OpenDocument(pluginRef, accRec, &docRef); if (err != noErr) goto bail; docOpen = true; /* Here, we loop until the plug-in returns no more bytes */ while (true) { /* attempt to fill the entire buffer with text. */ bytecount = kETBufferSize; err = extractor->GetNextTextRun(docRef, etBuffer, &bytecount); /* if some other error occurs, such as eofErr... we exit... */ if (err != noErr) goto bail; /* errIAEndOfTextRun is returned when the language encoding changes. in this case, we do nothing, but in some cases we may wish to do some additional processing. */ if (err == errIAEndOfTextRun) { /* we don't check the bytecount here because conceivably errIAEndOfTextRun could be returned with a zero sized buffer simply to indicate the beginning of a new character encoding range in cases where the last call read all of the characters from the last encoding run.*/ /* normal termination occurs when zero bytes are returned. */ } else if (bytecount == 0) break; /* at this point, we have a chunk of text from the from the document. Here, we pass it back to the caller’s sink. */ err = textsink(etBuffer, bytecount, refcon); if (err != noErr) goto bail; } /* at this point, all of the text in the document has been read. Now, we close down the document by asking the plug-in to close, disposing of the memory buffer, and then disposing the file input callback structure. DisposeIADocAccessorRec is defined in Listing 17. */ extractor->CloseDocument(docRef); docOpen = false; DisposePtr(etBuffer); etBuffer = NULL; DisposeIADocAccessorRec(docAcc); docAcc = NULL; /* After closing the document, the plug-in is released. This is done by calling the plug-in’s termination procedure, releasing the memory allocation callbacks (DisposeIAPluginInitBlock is defined in Listing 16) and then releasing the plug-in’s code fragment (CloseExtractor is defined in Listing 15). */ extractor->PluginTerm(pluginRef); exInited = false; DisposeIAPluginInitBlock(initblock); initblock = NULL; CloseExtractor(extractor); extractor = NULL; /* return success */ return noErr; bail: /* error handling code. note, ordering of the recovery statements is important. */ if (docOpen) extractor->CloseDocument(docRef); if (etBuffer != NULL) DisposePtr(etBuffer); if (docAcc != NULL) DisposeIADocAccessorRec(docAcc); if (exInited) extractor->PluginTerm(pluginRef); if (initblock != NULL) DisposeIAPluginInitBlock(initblock); if (extractor != NULL) CloseExtractor(extractor); return err; } |
The routine provided in Listing 18 performs the actual text extraction by calling the plug-in’s routines directly. In this example, no attention is paid to the language encoding or character encoding, but this example could easily be modified to return this information. This routine uses structures and calls routines defined in Listing 15, Listing 16, and Listing 17.
Index of Code ListingsThe following code listings are provided in this document. Listings 5 through 14 define the content of the sample plug-in, and listings 15 through 18 illustrate how to call a plug-in from an application.
|
|
Thanks to the usual suspects.